This is the personal medical cost data set with the variables age, sex, bmi, children, smoker, region, and charges. The analysis indicates that most medical cost personal data originates from the southeast region, while the northeast, northwest, and southwest regions contribute roughly equal amounts. The body mass index (BMI) distribution appears normally distributed, whereas the charges distribution is heavily skewed right, with outliers present in all regions except the northwest. Generally, non-smokers tend to have lower charges, although there are instances where charges are consistently within the 20k to 30k range irrespective of smoking status. A smooth line is suggested to summarize the relationship between age and charges, providing insight into the underlying trend, though it may overlook complex patterns and outliers. The majority of individuals in the dataset have no children, with the percentage decreasing as the number of children increases, and outliers are particularly noticeable in the categories of 0 children and 1 child. Overall, most plots exhibit a right-skewed distribution.
Most medical cost personal data comes from the southeast region. The northeast, northwest, and southwest are about the same amount.
The histogram of the bmi distribution looks normally distributed.
The histogram of the charges distribution is heavily skewed right.
The northwest region is the only plot without an outlier. All of the regions looks normally distributed. Southeast region quartile range is on the bigger side compared to the rest.
At certain areas of the plot there is like 3 groups of points that are together. It also looks like the older you are the more you will be charged.
It looks like if you did not smoke overall your charges were less compared to those that did smoke. In certain areas of the graph it also looks like even if you did not smoke your charge is around the 20k to 30k range.
The smooth line goes in between the 2 groups of points. I would believe it would make senese to use the smooth line to summarize the relationship between age of clients and the corresponding charges. It helps capture the underlying trend in the data and provides an initial insight into the relationship. However, it may not capture more complex patterns and outliers.
In this case the line is closer to more of the points. There are minorities of scattered points above the line.
There are many things you could do if you want to model charges using other variables in this data. You could identify and handle outliers that might impact the data. You could also find the impact of categorical variables like sex or region on charges
A majority of the people in the data have 0 children. The more children a person has the lower the percentage.
There are many outliers much more in 0 children and 1 child compared to the rest. It also looks like all of the plots are skewed right.
---
title: "Assignment 7"
output:
flexdashboard::flex_dashboard:
theme:
version: 4
bootswatch: default
navbar-bg: "purple"
orientation: columns
vertical_layout: fill
source_code: embed
---
```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
library(DT)
insurance <- read_csv("./insurance.csv")
```
Overview
===
Column {data-width=550}
---
### <b><font size = 4><span Style = "color:black">Data Set
Observations</span></font></b>
```{r table}
datatable(insurance, rownames = FALSE, colnames= c("age", "sex", "bmi", "children", "smoker", "region", "charges"), options = list(pageLength = 20))
```
Column {data-width=450}
---
### Overall Analysis
This is the personal medical cost data set with the variables age, sex, bmi, children, smoker, region, and charges. The analysis indicates that most medical cost personal data originates from the southeast region, while the northeast, northwest, and southwest regions contribute roughly equal amounts. The body mass index (BMI) distribution appears normally distributed, whereas the charges distribution is heavily skewed right, with outliers present in all regions except the northwest. Generally, non-smokers tend to have lower charges, although there are instances where charges are consistently within the 20k to 30k range irrespective of smoking status. A smooth line is suggested to summarize the relationship between age and charges, providing insight into the underlying trend, though it may overlook complex patterns and outliers. The majority of individuals in the dataset have no children, with the percentage decreasing as the number of children increases, and outliers are particularly noticeable in the categories of 0 children and 1 child. Overall, most plots exhibit a right-skewed distribution.
Bar Plot
===
Column {data-width=500}
---
```{r Barplot}
ggplot(insurance, aes(x = region, fill = region)) +
geom_bar() +
ggtitle("Distribution of Region") +
xlab("Region") +
ylab("Count") +
theme_minimal()
```
Column {data-width=500}
---
### Analysis
Most medical cost personal data comes from the southeast region. The northeast, northwest, and southwest are about the same amount.
Stack Bar Plot
===
Column {data-width=1000}
---
```{r Bar Stack}
ggplot(insurance, aes(x = region, fill = smoker)) +
geom_bar(position = "fill") +
ggtitle("Smoker Distribution in Each Region") +
xlab("Region") +
ylab("Percentage") +
theme_minimal()
```
Histogram bmi
===
Column {data-width=500}
---
```{r Histogram bmi}
ggplot(insurance, aes(x = bmi)) +
geom_histogram(binwidth = 2, fill = "blue", color = "black") +
ggtitle("BMI Distribution") +
xlab("BMI") +
ylab("Count") +
theme_minimal()
```
Column {data-width=500}
---
### Analysis
The histogram of the bmi distribution looks normally distributed.
Histogram Charges
===
Column {data-width=500}
---
```{r Histogram Charges}
ggplot(insurance, aes(x = charges)) +
geom_histogram(binwidth = 1000, fill = "green", color = "black") +
ggtitle("Charges Distribution") +
xlab("Charges") +
ylab("Count") +
theme_minimal()
```
Column {data-width=500}
---
### Analysis
The histogram of the charges distribution is heavily skewed right.
Boxplot bmi
===
Column {data-width=500}
---
```{r Boxplot bmi}
ggplot(insurance, aes(x = region, y = bmi, fill = region)) +
geom_boxplot() +
ggtitle("Distribution of BMI Based on Region") +
xlab("Region") +
ylab("BMI") +
theme_minimal()
```
Column {data-width=500}
---
### Analysis
The northwest region is the only plot without an outlier. All of the regions looks normally distributed. Southeast region quartile range is on the bigger side compared to the rest.
Scatterplot
===
Column {.tabset data-width=550}
---
### Scatterplot 1
```{r Scatterplot 1}
ggplot(insurance, aes(x = age, y = charges)) +
geom_point() +
ggtitle("Relationship between Age and Charges") +
xlab("Age") +
ylab("Charges") +
theme_minimal()
```
### Scatterplot 2
```{r Scatterplot 2}
ggplot(insurance, aes(x = age, y = charges, color = smoker)) +
geom_point() +
ggtitle("Relationship between Age, Charges, and Smoker Status") +
xlab("Age") +
ylab("Charges") +
theme_minimal()
```
Column {data-width=450}
---
### Analysis of Scatterplot 1
At certain areas of the plot there is like 3 groups of points that are together. It also looks like the older you are the more you will be charged.
### Analysis of Scatterplot 2
It looks like if you did not smoke overall your charges were less compared to those that did smoke. In certain areas of the graph it also looks like even if you did not smoke your charge is around the 20k to 30k range.
Scatterplot Extra
===
```{r}
smoker <- insurance[insurance$smoker == "yes", ]
nonsmoker <- insurance[insurance$smoker == "no", ]
```
Column {.tabset data-width=550}
---
### Scatterplot 1
```{r}
ggplot(smoker, aes(x = age, y = charges)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
ggtitle("Relationship between Age and Charges for Smokers") +
xlab("Age") +
ylab("Charges") +
theme_minimal()
```
### Scatterplot 2
```{r}
ggplot(nonsmoker, aes(x = age, y = charges)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
ggtitle("Relationship between Age and Charges for Non-Smokers") +
xlab("Age") +
ylab("Charges") +
theme_minimal()
```
Column {data-width=450}
---
### Analysis of Scatterplot 1
The smooth line goes in between the 2 groups of points. I would believe it would make senese to use the smooth line to summarize the relationship between age of clients and the corresponding charges. It helps capture the underlying trend in the data and provides an initial insight into the relationship. However, it may not capture more complex patterns and outliers.
### Analysis of Scatterplot 2
In this case the line is closer to more of the points. There are minorities of scattered points above the line.
### Overall
There are many things you could do if you want to model charges using other variables in this data. You could identify and handle outliers that might impact the data. You could also find the impact of categorical variables like sex or region on charges
Pie Chart
===
Column {data-width=500}
---
```{r Pie Chart}
children_count <- count(insurance, children)
children_count$percent <- round(children_count$n / sum(children_count$n) * 100, 2)
pie_chart <- ggplot(children_count, aes(x = "", y = percent, fill = factor(children))) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0) +
geom_text(aes(label = paste0(children, "\n", percent, "%")),
fontface = "bold", color = "black",
position = position_stack(vjust = 0.5)) +
scale_fill_brewer(palette = "Oranges") +
theme_void() +
theme(text = element_text(size = 20)) +
labs(fill = "Number of Children")
pie_chart
```
Column {data-width=500}
---
### Analysis
A majority of the people in the data have 0 children. The more children a person has the lower the percentage.
Boxplot Extra
===
Column {data-width=500}
---
```{r Boxplot Extra}
ggplot(insurance, aes(x = factor(children), y = charges)) +
geom_boxplot() +
ggtitle("Distribution of Charges Based on Number of Children") +
xlab("Number of Children") +
ylab("Charges") +
theme_minimal()
```
Column {data-width=500}
---
### Analysis
There are many outliers much more in 0 children and 1 child compared to the rest. It also looks like all of the plots are skewed right.